Data website link: https://www.kaggle.com/spittman1248/cdc-data-nutrition-physical-activity-obesity
Data cleaning comment: The data set consists of 53392 observations and 16 attributes. The columns are filled with different data types such as integers, decimals, and characters. For data cleaning, I experimented using tibble because it is supposedly easier to work with large data. The select() function moves variables to a new variable, filter() function returns rows with matching conditions, and arrange() function sorts a variable in descending order.
library(tibble)
## Warning: package 'tibble' was built under R version 4.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.1.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
library(class)
library(e1071)
CDC <- read.csv("cdc.csv")
# Basic R functions for data exploration
str(CDC)
## 'data.frame': 53392 obs. of 33 variables:
## $ YearStart : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ YearEnd : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ LocationAbbr : chr "AL" "AL" "AL" "AL" ...
## $ LocationDesc : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Datasource : chr "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
## $ Class : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Topic : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Question : chr "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" ...
## $ Data_Value_Unit : logi NA NA NA NA NA NA ...
## $ Data_Value_Type : chr "Value" "Value" "Value" "Value" ...
## $ Data_Value : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Alt : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Footnote_Symbol: chr "" "" "" "" ...
## $ Data_Value_Footnote : chr "" "" "" "" ...
## $ Low_Confidence_Limit : num 30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
## $ High_Confidence_Limit : num 33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
## $ Sample_Size : int 7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
## $ Total : chr "Total" "" "" "" ...
## $ Age.years. : chr "" "" "" "" ...
## $ Education : chr "" "" "" "Less than high school" ...
## $ Gender : chr "" "Male" "Female" "" ...
## $ Income : chr "" "" "" "" ...
## $ Race.Ethnicity : chr "" "" "" "" ...
## $ GeoLocation : chr "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
## $ ClassID : chr "OWS" "OWS" "OWS" "OWS" ...
## $ TopicID : chr "OWS1" "OWS1" "OWS1" "OWS1" ...
## $ QuestionID : chr "Q036" "Q036" "Q036" "Q036" ...
## $ DataValueTypeID : chr "VALUE" "VALUE" "VALUE" "VALUE" ...
## $ LocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StratificationCategory1 : chr "Total" "Gender" "Gender" "Education" ...
## $ Stratification1 : chr "Total" "Male" "Female" "Less than high school" ...
## $ StratificationCategoryId1 : chr "OVR" "GEN" "GEN" "EDU" ...
## $ StratificationID1 : chr "OVERALL" "MALE" "FEMALE" "EDUHS" ...
names(CDC)
## [1] "YearStart" "YearEnd"
## [3] "LocationAbbr" "LocationDesc"
## [5] "Datasource" "Class"
## [7] "Topic" "Question"
## [9] "Data_Value_Unit" "Data_Value_Type"
## [11] "Data_Value" "Data_Value_Alt"
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"
## [15] "Low_Confidence_Limit" "High_Confidence_Limit"
## [17] "Sample_Size" "Total"
## [19] "Age.years." "Education"
## [21] "Gender" "Income"
## [23] "Race.Ethnicity" "GeoLocation"
## [25] "ClassID" "TopicID"
## [27] "QuestionID" "DataValueTypeID"
## [29] "LocationID" "StratificationCategory1"
## [31] "Stratification1" "StratificationCategoryId1"
## [33] "StratificationID1"
summary(CDC)
## YearStart YearEnd LocationAbbr LocationDesc
## Min. :2011 Min. :2011 Length:53392 Length:53392
## 1st Qu.:2012 1st Qu.:2012 Class :character Class :character
## Median :2013 Median :2013 Mode :character Mode :character
## Mean :2013 Mean :2013
## 3rd Qu.:2015 3rd Qu.:2015
## Max. :2016 Max. :2016
##
## Datasource Class Topic Question
## Length:53392 Length:53392 Length:53392 Length:53392
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Data_Value_Unit Data_Value_Type Data_Value Data_Value_Alt
## Mode:logical Length:53392 Min. : 0.90 Min. : 0.90
## NA's:53392 Class :character 1st Qu.:24.10 1st Qu.:24.10
## Mode :character Median :30.70 Median :30.70
## Mean :31.16 Mean :31.16
## 3rd Qu.:37.00 3rd Qu.:37.00
## Max. :77.60 Max. :77.60
## NA's :5046 NA's :5046
## Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
## Length:53392 Length:53392 Min. : 0.30
## Class :character Class :character 1st Qu.:20.00
## Mode :character Mode :character Median :26.45
## Mean :26.89
## 3rd Qu.:32.90
## Max. :69.50
## NA's :5046
## High_Confidence_Limit Sample_Size Total Age.years.
## Min. : 3.00 Min. : 50 Length:53392 Length:53392
## 1st Qu.:28.20 1st Qu.: 566 Class :character Class :character
## Median :35.60 Median : 1209 Mode :character Mode :character
## Mean :35.99 Mean : 3889
## 3rd Qu.:42.20 3rd Qu.: 2519
## Max. :87.70 Max. :476876
## NA's :5046 NA's :5046
## Education Gender Income Race.Ethnicity
## Length:53392 Length:53392 Length:53392 Length:53392
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## GeoLocation ClassID TopicID QuestionID
## Length:53392 Length:53392 Length:53392 Length:53392
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## DataValueTypeID LocationID StratificationCategory1 Stratification1
## Length:53392 Min. : 1.00 Length:53392 Length:53392
## Class :character 1st Qu.:17.00 Class :character Class :character
## Mode :character Median :30.00 Mode :character Mode :character
## Mean :30.28
## 3rd Qu.:44.00
## Max. :78.00
##
## StratificationCategoryId1 StratificationID1
## Length:53392 Length:53392
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
head(CDC)
tail(CDC)
CDC2 <- select(CDC, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Low_Confidence_Limit, Education, Gender, Income,`Age.years.`,`Race.Ethnicity`, Stratification1, StratificationCategoryId1, StratificationID1, GeoLocation, Sample_Size)
as_tibble(CDC2)
CDC_overweight <- select(CDC2, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Stratification1)
CDC_overweight <- filter(CDC_overweight, Question == "Percent of adults aged 18 years and older who have an overweight classification", LocationAbbr == "US", Stratification1 == "Total", LocationAbbr != "PR", LocationAbbr != "GU", LocationAbbr !="VI")
CDC_obese <- select(CDC2, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Stratification1)
CDC_obese <- filter(CDC_obese, Question == "Percent of adults aged 18 years and older who have obesity", LocationAbbr == "US", Stratification1 == "Total", LocationAbbr != "PR", LocationAbbr != "GU", LocationAbbr !="VI")
arrange(CDC_overweight, YearEnd)
arrange(CDC_obese, YearEnd)
Graphs comment: According to the graph, overweight rates have decreased until 2014, which infers that Americans are losing weight annually. However, the obesity rates have been increasing throughout the years, so Americans are gaining weight. In other words, people classified as overweight are transitioning to obese.
overweight_plot <- ggplot(data=CDC_overweight, aes(x = YearEnd, y = Data_Value, group=1)) +
geom_line(color="red")+
geom_point()+labs(title = "2011-2016 US Overweight Rates", x = "Year", y = "Overweight Rate(%)")
obese_plot <- ggplot(data=CDC_obese, aes(x = YearEnd, y = Data_Value, group=1)) +
geom_line(color="blue")+
geom_point()+labs(title = "2011-2016 US Obesity Rates", x = "Year", y = "Obesity Rate(%)")
grid.arrange(arrangeGrob(obese_plot, overweight_plot, ncol = 2))
Linear regression comment: I decided to choose these features because it seemed logical to assume that there would be a positive relationship between adults not consuming vegetables and the obesity rate. As a result, I wanted to affirm that my hypothesis was correct. The linear regression verifies that not eating vegetables could lead to an increase in obesity. Obs_rate = 0.04512 * Veg + 28.34208. As the percentage of not eating vegetables increases, so will the obesity rate. The RSE is 7.143, making the percentage error (7.143/50.81648) * 100 = 14.056. The RMSE states that our test data is off by a 7.128568 obesity rate on average. Everything looks decent except the fact that the p-value is not very small.
Veg <- select(CDC2, Question, Data_Value)
Veg <- filter(Veg, Question=="Percent of adults who report consuming vegetables less than one time daily")
Veg <- select(Veg, Data_Value)
Obs_rate <- select(CDC2, Question, Data_Value)
Obs_rate <- filter(Obs_rate, Question == "Percent of adults aged 18 years and older who have obesity")
Obs_rate <- select(Obs_rate, Data_Value)
Veg <- na.omit(Veg)
Obs_rate <- na.omit(Obs_rate)
Veg <- unlist(Veg)
Obs_rate <- unlist(Obs_rate)
Veg <- Veg[-c(500:3996)]
Obs_rate <- Obs_rate[-c(500:8127)]
beg <- proc.time()
lm1 <- lm(Obs_rate~Veg)
end <- proc.time()
summary(lm1)
##
## Call:
## lm(formula = Obs_rate ~ Veg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7611 -3.8328 0.1856 4.7247 24.7887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.34208 1.07104 26.462 <2e-16 ***
## Veg 0.04512 0.04409 1.023 0.307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.143 on 497 degrees of freedom
## Multiple R-squared: 0.002103, Adjusted R-squared: 9.5e-05
## F-statistic: 1.047 on 1 and 497 DF, p-value: 0.3066
mse1 <- mean(lm1$residuals^2)
mse1
## [1] 50.81648
rmse1 <- sqrt(mse1)
rmse1
## [1] 7.128568
time <- end - beg
time
## user system elapsed
## 0.02 0.00 0.01
plot(Veg, Obs_rate, main = "Scatterplot of Adults Not Consuming Vegatables Daily vs Obesity Rate", xlab = "Adults Not Consuming Vegatables Daily (%)", ylab = "Obesity Rate (%)", col = "black")
abline(lm1, col = "blue")
kNN regression comment: I would think the percentage of people not eating vegetables and not exercising would affect the obesity rate. However, the accuracy is not high even with the most optimal k-value, so we can assume that kNN might not be the right approach here. The time performance is a bit slower than linear regression as well. Another reason is that the data wasn’t correctly scaled. A solution to find the best k is to use a for loop and traverse the kNN function within it. I manually changed the k value because the range for my k is not substantial.
Phy <- select(CDC2, Question, Data_Value)
Phy <- filter(Phy, Question == "Percent of adults who engage in no leisure-time physical activity")
Phy <- select(Phy, Data_Value)
Phy <- na.omit(Phy)
beg <- proc.time()
knn_pred <- knn(Veg, Phy, Obs_rate, k = 3)
end <- proc.time()
knn_pred <- knn_pred[-c(500:8127)]
knn_pred <- as.numeric(knn_pred)
mean(knn_pred == Obs_rate)
## [1] 0
time <- end - beg
time
## user system elapsed
## 0.02 0.00 0.01
SVM regression comment: I made gender a factor related to the year, sample size, data value, and confidence limits. I chose gender to be a factor because it is a binary value where (0 equals female and 1 equals male) so it can provide some valuable data combined with other data elements. The runtime is two milliseconds slower than kNN. Nonetheless, the accuracy is 88.4%, so the results suggest that our model is a good classifier for the data.
CDC3 <- select(CDC, YearEnd, Sample_Size, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Gender)
CDC3$Gender <- as.factor(CDC3$Gender)
CDC3 <- CDC3[-c(500:nrow(CDC3)), ]
CDC3 <- CDC3[!apply(CDC3 == "", 1, all), ]
CDC3 <- na.omit(CDC3)
set.seed(1234)
i <- sample(1:nrow(CDC3), 0.75*nrow(CDC3), replace=FALSE)
train <- CDC3[i,]
test <- CDC3[-i,]
beg <- proc.time()
svm1 <- svm(Gender~ ., data = train, kernel = "linear", cost = 10, scale = TRUE)
end <- proc.time()
summary(svm1)
##
## Call:
## svm(formula = Gender ~ ., data = train, kernel = "linear", cost = 10,
## scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 49
##
## ( 16 19 14 )
##
##
## Number of Classes: 3
##
## Levels:
## Female Male
pred <- predict(svm1, newdata = test)
table(pred == test$Gender)
##
## FALSE TRUE
## 13 99
mean(pred == test$Gender)
## [1] 0.8839286
time <- end - beg
time
## user system elapsed
## 0.03 0.00 0.03
Results analysis: The processing order goes from linear regression, kNN, and SVM, where linear regression is the fastest. Linear regression is interpretable and straightforward but sensitive to outliers. kNN does not assume the shape of the data compared to linear regression, but extra steps like scaling the data are necessary to get good results. SVM deals with complex calculations and decision boundaries, so computing results prove to be stressful. Data doesn’t show a correlation between not eating vegetables and not exercising with obesity prevalence. However, there could be some computation error or data faults. This would make sense because the energy obtained from eating foods is not expended. Thus, logically speaking, not eating vegetables and not exercising would correlate to the obesity rate in some way or another.